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Abstract 

We generalize Karp-Rabin string matching to handle multiple patterns in 0{nlogn + m) time and 
0(s) space, where n is the length of the text and m is the total length of the s patterns, returning correct 
answers with high probability. As a prime application of our algorithm, we show how to approximate the 
LZ77 parse of a string of length n. If the optimal parse consists of z phrases, using only 0{z) working 
space we can return a parse consisting of at most (l + s)z phrases in 0{£~^nlogn) time, for any e € (0,1]. 
As previous quasilinear-time algorithms for LZ77 use f2(n/poly log n) space, but « can be exponentially 
small in n, these improvements in space are substantial. 


1 Introduction 

Multiple-pattern matching, the task of locating the occurrences of s patterns of total length m in a single text 
of length n, is a fundamental problem in the field of string algorithms. The algorithm by Aho and Corasick [2] 
solves this problem using 0{n + 'm) time and 0{m) working space in addition to the space needed for the text 
and patterns. To list all occ occurrences rather than, e.g., the leftmost ones, extra D(occ) time is necessary. 
When the space is limited, we can use a compressed Aho-Corasick automaton m- In extreme cases, one 
could apply a linear-time constant-space single-pattern matching algorithm sequentially for each pattern in 
turn, at the cost of increasing the running time to 0{n ■ s + m). Well-known examples of such algorithms 
include those by Galil and Seiferas [5], Crochemore and Perrin [^, and Karp and Rabin [13] (see |5] for a 
recent survey). 

It is easy to generalize Karp-Rabin matching to handle multiple patterns in D(n -1- m) expected time and 
0{s) working space provided that all patterns are of the same length [101. To do this, we store the fingerprints 
of the patterns in a hash table, and then slide a window over the text maintaining the fingerprint of the 
fragment currently in the window. The hash table lets us check if the fragment is an occurrence of a pattern. 
If so, we report it and update the hash table so that every pattern is returned at most once. This is a very 
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simple and actually applied idea [T], but it is not clear how to extend it for patterns with many distinct 
lengths. In this paper we develop a dictionary matching algorithm which works for any set of patterns in 
O{nlogn + m) time and 0(s) working space, assuming that read-only random access to the text and the 
patterns is available. If required, we can compute for every pattern its longest prefix occurring in the text, 
also in 0(nlogn + m) time and 0(s) working space. 

In a very recent independent work Clifford et al. [1] gave a dictionary matching algorithm in the streaming 
model. In this setting the patterns and later the text are scanned once only (as opposed to read-only random 
access) and an occurrence needs to be reported immediately after its last character is read. Their algorithm 
uses O{s\og£) space and takes (!I(loglog(s -I- £)) time per character where £ is the length of the longest 
pattern {'^ < £ < m). Even though some of the ideas used in both results are similar, one should note that 
the streaming and read-only models are quite different. In particular, computing the longest prefix occurring 
in the text for every pattern requires n(mlogmin(n, |E|)) bits of space in the streaming model, as opposed 
to the 0{s) working space achieved by our solution in the read-only setting. 

As a prime application of our dictionary matching algorithm, we show how to approximate the Lempel- 
Ziv 77 (LZ77) parse [I8j of a text of length n using working space proportional to the number of phrases 
(again, we assume read-only random access to the text). Computing the LZ77 parse in small space is an 
issue of high importance, with space being a frequent bottleneck of today’s systems. Moreover, LZ77 is 
useful not only for data compression, but also as a way to speed up algorithms m- We present a general 
approximation algorithm working in 0{z) space for inputs admitting LZ77 parsing with z phrases. For any 
e S (0,1], the algorithm can be used to produce a parse consisting of {l + e)z phrases in C>(£“^nlogn) time. 

To the best of our knowledge, approximating \IL11 factorization in small space has not been considered 
before, and our algorithm is significantly more efficient than methods producing the exact answer. A recent 
sublinear-space algorithm, due to Karkkainen et al. |I2j . runs in 0(nd) time and uses 0(n/d) space, for any 
parameter d. An earlier online solution by Gasieniec et al. [9] uses 0{z) space and takes 0{z^\og'^ z) time 
for each character appended. Other previous methods use significantly more space when the parse is small 
relative to n; see [7j for a recent discussion. 

Structure of the paper. Sect. introduces terminology and recalls several known concepts. This is 
followed by the description of our dictionary matching algorithm. In Sect. we show how to process 
patterns of length at most s and in Sect. we handle longer patterns, with different procedures for repetitive 
and non-repetitive ones. In Sect. we extend the algorithm to compute, for every pattern, the longest 
prefix occurring in the text. Finally, in Sect.j^ we apply the dictionary matching algorithm to construct an 
approximation of the LZ77 parsing, and in Sect. |^we explain how to modify the algorithms to make them 
Las Vegas. 

Model of computation. Our algorithms are designed for the word-RAM with fl(logn)-bit words and 
assume integer alphabet of polynomial size. The usage of Karp-Rabin fingerprints makes them Monte 
Carlo randomized: the correct answer is returned with high probability, i.e., the error probability is inverse 
polynomial with respect to input size, where the degree of the polynomial can be set arbitrarily large. With 
some additional effort, our algorithms can be turned into Las Vegas randomized, where the answer is always 
correct and the time bounds hold with high probability. Throughout the whole paper, we assume read¬ 
only random access to the text and the patterns, and we do not include their sizes while measuring space 
consumption. 


2 Preliminaries 

We consider hnite words over an integer alphabet E = {0,..., cr — 1}, where a = poly(n -I- m). For a word 
w = u>[l]... w[n] G E”, we define the length of w as |r(;| = n. For l<i<j<n, a, word u = w[i]... w[j] 
is called a subword of w. By w[i..j] we denote the occurrence of u at position i, called a fragment of w. A 
fragment with i = 1 is called a prefix and a fragment with j = n is called a suffix. 
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A positive integer p is called a period of w whenever ri;[z] = w[i + p] for all i = 1,2,..., |z(;| — p. In this 
case, the prefix is often also called a period of w. The length of the shortest period of a word w is 

denoted as per(z(;). A word w is called periodic if per(w) < ^ and highly periodic if per(?ii) < The 

well-known periodicity lemma [B] says that if p and q are both periods of w, and p + q < |w|, then gcd(p, q) 
is also a period of w. We say that word w is primitive if per(x(;) is not a proper divisor of |w|. Note that the 
shortest period w[l..per(w)] is always primitive. 

2.1 Fingerprints 

Our randomized construction is based on Karp-Rabin fingerprints; see |18j . Fix a word over an 

alphabet E = {0,..., cr—1}, a constant c > 1, a prime number p > max(cr, and choose x G Zp uniformly 

at random. We define the fingerprint of a subword w[i..j] as $(w[j..j]) = r(;[j]-|-w[i+l]x-|-.. .+w[j]x^~’‘ mod p. 
With probability at least 1 — no two distinct subwords of the same length have equal fingerprints. The 
situation when this happens for some two subwords is called a false-positive. From now on when stating the 
results we assume that there are no false-positives to avoid repeating that the answers are correct with high 
probability. For dictionary matching, we assume that no two distinct subwords oi w = TPi ... Pg have equal 
fingerprints. Fingerprints let us easily locate many patterns of the same length. A straightforward solution 
described in the introduction builds a hash table mapping fingerprints to patterns. However, then we can 
only guarantee that the hash table is constructed correctly with probability 1 — 0{-p) (for an arbitrary 
constant c), and we would like to bound the error probability by 0{ )■ Hence we replace hash table 

with a deterministic dictionary as explained below. Although it increases the time by ©(slogs), the extra 
term becomes absorbed in the final complexities. 

Theorem 1. Given a text T of length n and patterns Pi,..., Pg, each of length exactly I, we can compute 
the the leftmost occurrence of every pattern Pi in T using 0{n -I- s^ -I- slogs) total time and 0{s) space. 

Proof. We calculate the fingerprint $(Hj) of every pattern. Then we build in ©(slogs) time [TB] a deter¬ 
ministic dictionary T> with an entry mapping $(Fj) to j. For multiple identical patterns we create just 
one entry, and at the end we copy the answers to all instances of the pattern. Then we scan the text T 
with a sliding window of length i while maintaining the fingerprint ^{T[i..i -|- £ — 1]) of the current window. 
Using T), we can find in ©(1) time an index j such that $(T[z..i -I- £ — 1]) = ^{Pj), if any, and update the 
answer for Pj if needed (i.e., if there was no occurrence of Pj before). If we precompute x~^, the fingerprints 
^{T[i..i i — 1]) can be updated in ©(1) time while increasing i. 

□ 


2.2 Tries 

A trie of a collection of strings Pi,... ,Pg is a rooted tree whose nodes correspond to prefixes of the strings. 
The root represents the empty word and the edges are labeled with single characters. The node corresponding 
to a particular prefix is called its locus. In a compacted trie unary nodes that do not represent any Pi are 
dissolved and the labels of their incidents edges are concatenated. The dissolved nodes are called implicit as 
opposed to the explicit nodes, which remain stored. The locus of a string in a compacted trie might therefore 
be explicit or implicit. All edges outgoing from the same node are stored on a list sorted according to the 
first character, which is unique among these edges. The labels of edges of a compacted trie are stored as 
pointers to the respective fragments of strings Pi. Consequently, a compacted trie can be stored in space 
proportional to the number of explicit nodes, which is 0{s). 

Consider two compacted tries 7i and 7^. We say that (possibly implicit) nodes vi G Ti and V 2 G T 2 are 
twins if they are loci of the same string. Note that every vi G Ti has at most one twin V 2 G T 2 . 

Lemma 2. Given two compacted tries Ti and 72 constructed for si and S 2 strings, respectively, in 0{si-\- S 2 ) 
total time and space we can find for each explicit node vi G Ti a node V 2 G T 2 such that if vi has a twin in 
T 2 , then V 2 is its twin. (If vi has no twin in T 2 , the algorithm returns an arbitrary node V 2 G T 2 ). 
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Proof. We recursively traverse both tries while maintaining a pair of nodes G 7i and U 2 S 72, starting 
with the root of 7i and 72 satisfying the following invariant: either vi and V 2 are twins, or vi has no twin in 
72- If is explicit, we store V 2 as the candidate for its twin. Next, we list the (possibly implicit) children 
of Vi and V 2 and match them according to the edge labels with a linear scan. We recurse on all pairs of 
matched children. If both vi and V 2 are implicit, we simply advance to their immediate children. The last 
step is repeated until we reach an explicit node in at least one of the tries, so we keep it implicit in the 
implementation to make sure that the total number of operations is 0{si + 52 )- If a node u € 7i is not 
visited during the traversal, for sure it has no twin in 72- Otherwise, we compute a single candidate for its 
twin. □ 


3 Short Patterns 

To handle the patterns of length not exceeding a given threshold we first build a compacted trie for those 
patterns. Construction is easy if the patterns are sorted lexicographically: we insert them one by one into 
the compacted trie first naively traversing the trie from the root, then potentially partitioning one edge into 
two parts, and finally adding a leaf if necessary. Thus, the following result suffices to efficiently build the 
tries. 

Lemma 3. One can lexicographically sort strings Pi,..., Ps of total length m in 0(m + u'^) time using 0{s) 
space, for any constant e > 0. 

Proof We separately sort the longest strings and all the remaining strings, and then merge both 

sorted lists. Note these longest strings can be found in 0(s) time using a linear time selection algorithm. 

Long strings are sorted using insertion sort. If the longest common prefixes between adjacent (in the 
sorted order) strings are computed and stored, inserting Pj can be done in 0{j + |Pj|) time. In more detail, 
let S'!, S' 2 ,..., Sj-i be the sorted list of already processed strings. We start with k := 1 and keep increasing 
k by one as long as Sk is lexicographically smaller than Pj while maintaining the longest common prefix 
between Sk and Pj, denoted i. After increasing k by one, we update £ using the longest common prefix 
between Sk-i and Sk, denoted £', as follows. If £' > £, we keep I unchanged. If I' = £, we try to iteratively 
increase I by one as long as possible. In both cases, the new value of I allows us to lexicographically compare 
Sk and Pj in constant time. Finally, £' < £ guarantees that Pj < Sk and we may terminate the procedure. 
Sorting the ^/m + longest strings using this approach takes 0{m + [\/m + cr^/^)^) = Olm + cr'^) time. 

The remaining strings are of length at most each, and if there are any, then s > We sort these 

strings by iteratively applying radix sort, treating each symbol from S as a sequence of | symbols from 
{0,1,... — 1}. Then a single radix sort takes time and space proportional to the number of strings 

involved plus the alphabet size, which is 0{s + cr'^/^) = 0{s). Furthermore, because the numbers of strings 
involved in the subsequent radix sorts sum up to to, the total time complexity is y/m) = 0{m+cr‘^). 

Finally, the merging takes time linear in the sum of the lengths of all the involved strings, so the total 
complexity is as claimed. □ 

Next, we partition T into 0{j) overlapping blocks Ti = r[1..2.^], T 2 = T[£+1..M], T 3 = T[2£ + 1..4£],_ 

Notice that each subword of length at most £ is completely contained in some block. Thus, we can consider 
every block separately. 

The suffix tree of each block Ti takes O{£\og£) time [17] and 0{£) space to construct and store (the suffix 
tree is discarded after processing the block). We apply Lemmato the suffix tree and the compacted trie 
of patterns; this takes 0(£ + s) time. For each pattern Pj we obtain a node such that the corresponding 
subword is equal to Pj provided that Pj occurs in Ti. We compute the leftmost occurrence Ti\f)..e] of the 
subword, which takes constant time if we store additional data at every explicit node of the suffix tree, 
and then we check whether Ti\b..e] = Pj using fingerprints. For this, we precompute the fingerprints of all 
patterns, and for each block Ti we precompute the fingerprints of its prefixes in 0{£) time and space, which 
allows to determine the fingerprint of any of its subwords in constant time. 
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In total, we spend 0{m + cr^) for preprocessing and 0(£log£ + s) for each block. Since a = {n + , 

for small enough e this yields the following result. 

Theorem 4. Given a textT oflengthn and patterns Pi,Pg of total lengthm, using O(n\ogi + Sj +m) 
total time and 0{s + £) space we can compute the leftmost occurrences in T of every pattern Pj of length at 
most £. 


4 Long Patterns 

To handle patterns longer than a certain threshold, we first distribute them into groups according to the 
value of [log 4/3 Patterns longer than the text can be ignored, so there are Oilogn) groups. Each 

group is handled separately, and from now on we consider only patterns Pj satisfying [log 4/3 \Pj\\ 

We classify the patterns into classes depending on the periodicity of their prefixes and suffixes. We 
set £ = [(4/3)®] and define aj and fij as, respectively, the prefix and the suffix of length £ of Pj. Since 

following fact yields a classification of the patterns into three classes: either 
Pj is highly periodic, or aj is not highly periodic, or fij is not highly periodic. The intuition behind this 
classification is that if the prefix or the suffix is not repetitive, then we will not see it many times in a short 
subword of the text. On the other hand, if both the prefix and suffix are repetitive, then there is some 
structure that we can take advantage of. 

Fact 5. Suppose x and y are a prefix and a suffix of a word w, respectively. If \x\ + \y\ > |?n| +p and p is a 
period of both x and y, then p is a period of w. 

Proof. We need to prove that ui[i] = w[i + p] for alH = 1, 2,..., |u>| — p. If i + p < |a:| this follows from p 
being a period of x, and if i > |w| — |y| + 1 from p being a period of y. Because |a;| + |p| > licl +p, these two 
cases cover all possible values of b □ 

To assign every pattern to the appropriate class, we compute the periods of Pj, aj and fj using small 
space. Roughly the same result has been proved in |14j . but for completeness we provide the full proof here. 

Lemma 6. Given a read-only string w one can decide in (!I(|w|) time and constant space if w is periodic 
and if so, compute per(?«). 

Proof. Let v be the prefix of w of length [^ 1^1] P be the starting position of the second occurrence of v 
in w, if any. We claim that if pei{w) < \\w\, then per(ui) = p — 1. Observe first that in this case v occurs 
at a position per(ui) + 1. Hence, peT{w) > p — 1. Moreover p — 1 is a period of r(;[l..|u| + p — 1] along with 
per(r(;). By the periodicity lemma, per(r(;) < b|i/;| < |u| implies that gcd(p — l,per(r(;)) is also a period of 
that prefix. Thus per(w) > p — 1 would contradict the primitivity of iy[l.. per(w)]. 

The algorithm computes the position p using a linear time constant-space pattern matching algorithm. 
If it exists, it uses letter-by-letter comparison to determine whether w[l..p— 1] is a period of w. If so, by the 
discussion above per(w) = p — 1 and the algorithm returns this value. Otherwise, 2per(?«) > |z(;|, i.e., w is 
not periodic. The algorithm runs in linear time and uses constant space. 

□ 


4.1 Patterns without Long Highly Periodic Prefix 

Below we show how to deal with patterns with non-highly periodic prefixes aj. Patterns with non-highly 
periodic suffixes fij can be processed using the same method after reversing the text and the patterns. 

Lemma 7. Let £ he an arbitrary integer. Suppose we are given a text T of length n and patterns Pi,... ,Ps 
such that for 1 < j < s we have £ < \Pj\ < ^£ and aj = Pj[\..£\ is not highly periodic. We can compute the 
leftmost and the rightmost occurrence of each pattern Pj in T using 0(n-|-s(l-|-j) log s s£) time and 0{s) 
space. 
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The algorithm scans the text T with a sliding window of length Whenever it encounters a subword 
equal to the prefix aj of some Pj, it creates a request to verify whether the corresponding suffix fij of length i 
occurs at the appropriate position. The request is processed when the sliding window reaches that position. 
This way the algorithm detects the occurrences of all the patterns. In particular, we may store the leftmost 
and rightmost occurrence of each pattern. 

We use the fingerprints to compare the subwords of T with aj and j3j. To this end, we precompute $(aj) 
and $(/3j) for each j. We also build a deterministic dictionary V |lf)j with an entry mapping d>(aj) to j for 
every pattern (if there are multiple patterns with the same value of the dictionary maps a fingerprint 

to a list of indices). These steps take 0{s£) and O(slogs), respectively. Pending requests are maintained 
in a priority queue Q, implemented using a binary heap[^as pairs containing the pattern index (as a value) 
and the position where the occurrence of /3j is anticipated (as a key). 


Algorithm 1: Processing patterns with non-highly periodic aj. 


1 

2 

3 

4 

5 

6 

7 
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for i = lton — £-|-ldo 
h := ^{w[i..i -\- £—1]) 
foreach j : $(aj) = h do 
I add a request {i + \Pj \ — i,j) to Q 
foreach request {i,j) £ Q at position i do 
if h — ^{Pj) then 

I report an occurrence of Pj at i+ £ — \Pj\ 
remove (i,j) from Q 


Algorithm provides a detailed description of the processing phase. Let us analyze its time and space 
complexities. Due to the properties of Karp-Rabin fingerprints, line can be implemented in 0{1) time. 
Also, the loops in linesandtakes extra 0(1) time even if the respective collections are empty. Apart from 
these, every operation can be assigned to a request, each of them taking 0(1) (lines [3]and [5p| or 0(log|Q|) 
(lines 1^ and time. To bound |Q|, we need to look at the maximum number of pending requests. 


Fact 8. For any pattern Pj just 0(1 -I- j) requests are created and at any time at most one of them is 
pending. 


Proof. Note that there is a one-to-one correspondence between requests concerning Pj and the occurrences 
of aj in T. The distance between two such occurrences must be at least \l, because otherwise the period 
of aj would be at most ^£, thus making aj highly periodic. This yields the 0(1 -I- j) upper bound on the 
total number of requests. Additionally, any request is pending for at most \Pj\ — I < \l iterations of the 
main for loop. Thus, the request corresponding to an occurrence of aj is already processed before the next 
occurrence appears. □ 


Hence, the scanning phase uses 0(s) space and takes 0(n -I- s(l -I- j)logs) time. Taking preprocessing 
into account, we obtain bounds claimed in Lemma [Tj 


4.2 Highly Periodic Patterns 

Lemma 9. Let i be an arbitrary integer. Given a text T of length n and a collection of highly periodic 
patterns Pi,. .. ,Ps such that for 1 < j < s we have £ < \Pj\ < ^£, we can compute the leftmost occurrence 
of each pattern Pj in T using 0{n s(l -|- j) logs -I- s£) total time and 0{s) space. 

The solution is basically the same as in the proof of Lemma except that the algorithm ignores certain 
shiftable occurrences. An occurrence of x at position i of T is called shiftable if there is another occurrence 

^Hash tables could be used instead of the heap and the deterministic dictionary. Although this would improve the time 
complexity in Lemma the running time of the algorithm in Thm. would not change and failures with probability inverse 
polynomial with respect to s would be introduced; see also a discussion before Thm. [T] 
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of X at position i — per(a:). The remaining occurrences are called non-shiftable. Notice that the leftmost 
occurrence is always non-shiftable, so indeed we can safely ignore some of the shiftable occurrences of the 
patterns. Because 2pei{Pj) < '^\Pj\ < < £, the following fact implies that if an occurrence of Pj is 

non-shiftable, then the occurrence of aj at the same position is also non-shiftable. 

Fact 10. Let y be a prefix of x such that |j/| > 2per(a;). Suppose x has a non-shiftable occurrence at position 
i in w. Then, the occurrence of y at position i is also non-shiftable. 

Proof. Note that per(i/) -|- per(a;) < |y| so the periodicity lemma implies that per(?/) = per(a;). 

Let X = p' where p is the shortest period of x. Suppose that the occurrence of y at position i is 
shiftable, meaning that y occurs at position i — per(a:). Since \y\ > per(a:), y occurring at position i — per(a;) 
implies that p occurs at the same position. Thus —per(a:)..z-|- \x\ — 1] = p’^'^^p'. But then x clearly occurs 
at position i —per(a:), which contradicts the assumption that its occurrence at position i is non-shiftable. □ 

Consequently, we may generate requests only for the non-shiftable occurrences of aj. In other words, if 
an occurrence of aj is shiftable, we do not create the requests and proceed immediately to line[^ To detect 
and ignore such shiftable occurrences, we maintain the position of the last occurrence of every aj. However, 
if there are multiple patterns sharing the same prefix aj^ = ... = aj ,., we need to be careful so that the 
time to detect a shiftable occurrence is 0(1) rather than 0(k). To this end, we build another deterministic 
dictionary, which stores for each d>(aj) a pointer to the variable where we maintain the position of the 
previously encountered occurrence of aj. The variable is shared by all patterns with the same prefix aj. 

It remains to analyze the complexity of the modified algorithm. First, we need to bound the number 
of non-shiftable occurrences of a single aj. Assume that there is a non-shiftable occurrence Uj at positions 
i' < i such that i' > i — Then i — i' < is a period of T[i'..i -1-^—1]. By the periodicity lemma, 
per(Q;j) divides i — i', and therefore aj occurs at position / —per(aj), which contradicts the assumption that 
the occurrence at position i! is non-shiftable. Consequently, the non-shiftable occurrences of every aj are at 
least characters apart, and the total number of requests and the maximum number of pending requests 
can be bounded by 0{s{l -f j)) and 0{s), respectively, as in the proof of Lemmaj^ Taking into the account 
the time and space to maintain the additional components, which are 0{n -\- slogs) and 0(s), respectively, 
the final bounds remain the same. 


4.3 Summary 

Theorem 11. Given a text T of length n and patterns Pi,..., Ps of total length m, using O(nlogn -\- m -\- 
Sjlogs) total time and 0(s) space we can compute the leftmost occurrences in T of every pattern Pj of 
length at least i. 


Proof. The algorithm distributes the patterns into C>(logn) groups according to their lengths, and then into 
three classes according to their repetitiveness, which takes 0{m) time and 0{s) space in total. Then, it 
applies either Lemma or Lemma on every class. It remains to show that the running times of all those 
calls sum up to the claimed bound. Each of them can be seen as 0{n) plus 0[\Pj\-\-[l-\- j^) log s) per every 
pattern Pj . Because £ < \Pj\ < n and there are 0(logn) groups, this sums up to 0{n log n-\-m-\-Sj log s). □ 


Using Thm. [^for all patterns of length at most min(n, s), and (if s < n) Thm. II for patterns of length 
at least s, we obtain our main theorem. 


Theorem 12. Given a text T of length n and patterns Pi,... ,Ps of total length m, we can compute the 
leftmost occurrence in T of every pattern Pj using O{n\ogn -\- m) total time and 0{s) space. 


5 Computing Longest Occurring Prefixes 

In this section we extend Thm. [T^ to compute, for every pattern Pj, its longest prefix occurring in the text. 
A straightforward extension uses binary search to compute the length £j of the longest prefix of Pj occurring 
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in T. All binary searches are performed in parallel, that is, we proceed in C>(logn) phases. In every phase 
we check, for every j, if + ^j\ occurs in T using Thm. 12 and then update the corresponding 

accordingly. This results in 0{n log^ n + m) total time complexity. To avoid the logarithmic multiplicative 
overhead in the running time, we use a more complex approach requiring a careful modihcation of all the 
components developed in Sections and 


Short patterns. We proceed as in Sect. while maintaining a tentative longest prefix occurring in T 
for every pattern Pj, denoted Pj[l..ij]. Recall that after processing a block Ti we obtain, for each pattern 
Pj, a node such that the corresponding substring is equal to Pj provided that Pj occurs in Ti. Now we 
need a stronger property, which is that for any length k, Ij < k < \Pj\, the ancestor at string depth k of 
that node (if any) corresponds to Pj[l..k] provided that Pj[l..k] occurs in T^. This can be guaranteed by 
modifying the procedure described in Lemma if a child of vi has no corresponding child of V 2 , we report 
V 2 as the twin of all nodes in the subtree rooted at that child of ui. (Notice that now the string depth of 
vi G Ti might be larger than the string depth of its twin V 2 G T 2 , but we generate exactly one twin for every 
vi G Ti-) Using the stronger property we can update every ij by first checking if Pj[l..ij] occurs in Ti, and if 
so incrementing ij as long as possible. In more detail, let Ti[b..e] denote the substring corresponding to the 
twin of Pj. If |Ti[I»..e]| < ij, there is nothing to do. Otherwise, we check whether Ti[b..{b + ij — I)] = Pj[l..ij] 
using fingerprints, and if so start to naively compare Ti[b + ij..e] and Pj[ij + I..|Pj|]. Because in the end 
^3 — updating every ij takes 0{sj + m) additional total time. 

Theorem 13. Given a text T of length n and patterns Pi,..., Pg of total length m, using 0{n log.^ + s j +to) 
total time and 0{s + £) space we can compute the longest prefix occurring in T for every pattern Pj of length 
at most i. 


Long patterns. As in Sect.|^ we again distribute all patterns of length at least i into groups. However, 
now for patterns in the Ath group (satisfying [log 4 y 3 l^iU = 9i — [(4/3)*] and additionally require 

that Pj[l..gi\ occurs in T. To verify that this condition is true, we process the groups in the decreasing order 
of the index i and apply Thm. [l]to prefixes Pj\l..gi\. If for some pattern Pj the prefix fails to occur in T, 
we replace Pj setting Pj := Pj\l..gi — Ij. Observe that this operation moves Pj to a group with a smaller 
index (or makes Pj a short pattern). Additionally, note that in subsequent steps the length of Pj decreases 
geometrically, so the total length of patterns for which we apply Thm. [^is 0(m) and thus the total running 
time of this preprocessing phase is 0(nlogn + m) as long as ^ > logs. Hence, from now on we consider only 
patterns Pj belonging to the Ath group, i.e., such that gi < \Pj \ < and Pj[l..gi] occurs in T. 

As before, we classify patterns depending on their periodicity. However, now the situation is more 
complex, because we cannot reverse the text and the patterns. As a warm-up, we first describe how to 
process patterns Pj with a non-highly periodic prefix aj = Pj[l..i]. While not used in the final solution, 
this step allows us to gradually introduce all the required modifications. Then we show to process all highly 
periodic patterns, and finally move to the general case, where patters are not highly periodic. 

Patterns with a non-highly periodic prefix. We maintain a tentative longest prefix occurring in T 
for every pattern Pj, denoted Pj[l..ij] and initialized with ij = i, and proceed as in Algorithm with 
the following modifications. In line|^ the new request is (i + ij — i + 1, j). In hne|^ we compare h with 
^{Pj[{ij+2 — i)..{ij + l)]). If these two fingerprints are equal, we have found an occurrence of Pj[l..ij + \]. In 
such case we try to further extend the occurrence by naively comparing Pj[ij + l..\Pj\\ with the corresponding 
fragment of T and incrementing ij as long as the corresponding characters match. For every Pj we also need 
to maintain ^{Pj [{ij + 2 — i)..{ij + 1)]), which can be first initialized in 0{i) time and then updated in 0{1) 
time whenever ij is incremented. Because at any time at most one request is pending for every pattern Pj 
(and thus, while updating ij no such request is pending), this modified algorithm correctly determines the 
longest occurring prefix for every pattern with non-highly periodic aj. 

Lemma 14. Let I be an arbitrary integer. Suppose we are given a text T of length n and patterns Pi,..., Pg 
such that, for 1 < j < s, we have i < \Pj\ < and aj = Pj[\..i] is not highly periodic. We can compute 



the longest prefix occurring in T for every pattern Pj using 0{n + s(l + j) logs + s£) total time using 0(s) 
space. 

Highly periodic patterns. As in Sect. |4.2l we observe that all shiftable occurrences of the longest prefix 
of Pj occurring in T can be ignored, and therefore it is enough to consider only non-shiftable occurrences of 
aj (by the same argument, because that longest prefix is of length at least \aj\). Therefore, we can again 
use Algorithm]^ with the same modifications. As for non-highly periodic a^-, we maintain a tentative longest 
prefix Pj[l..ij] for every pattern Pj. Whenever a non-shiftable occurrence of aj is detected, we create a 
new request to check if £j can be incremented. If so, we start to naively compare Pj[l..£j + 1] with the 
corresponding fragment of T. The total time and space complexity remain unchanged. 

Lemma 15. Let I be an arbitrary integer. Given a text T of length n and a collection of highly periodic 
patterns Pi,...,Ps such that, for 1 < j < s, we have I < \Pj\ < we can compute the longest prefix 
occurring in T for every pattern Pj using 0(n + s(l -|- j) logs -I- s£) total time and 0(s) space. 

General case. Now we describe how to process all non-highly periodic patterns Pj. This will be an 
extension of the simple modification described for the case of non-highly periodic prefix aj. We start with 
the following simple combinatorial fact. 

Fact 16. Let i > 3 be an integer and w be a non-highly periodic word of length at least £. Then there exists i 
such that w[i..i-\-£—l] is not highly periodic and either i = 1 or w[l..i-\-£—2] is highly periodic. Furthermore, 
such i can be found in Odtcl) time and constant space assuming read-only random access to w. 

Proof. If per(z(;[l..£]) > we are done. Otherwise, choose largest j such that per(w[l..j]) = per(u;[!..£]). 
Since w is not highly periodic, we have j < |?c|. Thus per(w[l..j]) < but per(?ii[l..j -f 1]) > ^£. We claim 
that i = j -\-2 — £ can be returned. We must argue that per(?ii[j -\-2 — £..j -f 1]) > \£. Otherwise, the periods 
of both rc[l..j] and w[j 2 — £..j -f 1] are at most \£. But these two substrings share a fragment of length 
£ — £> 1^, so by the periodicity lemma their periods are in fact the same, and then the whole w[l..j -\- 1] 
has period at most ^£, which is a contradiction. 

Regarding the implementation, we compute per(w[!..£]) using Lemma Then we check how far the 
period of w[l..£] extends in the whole w naively in 0{\w\) time and constant space. □ 

For every non-highly periodic pattern Pj we use Fact |16| to find its non-highly periodic substring of length 
£, denoted Pj[kj...kj -\-£—l], such that kj = 1 or Pj[l..kj -\-£ — 2] is highly periodic. We begin with checking 
if PjW..kj -I- ^ — 1] occurs in T using Lemma(if kj = 1, it surely does because of how we partition the 
patterns into groups). If not, we replace Pj setting Pj := Pj[l..kj -\- £ — 2], which is highly periodic and can 
be processed as already described. From now on we consider only patterns Pj such that Pj[kj..kj -\-£—V\ is 
not highly periodic and Pj[l..kj -\- £ — 1] occurs in T. 

We further modify Algorithm [l] to obtain Algorithm as follows. We scan the text T with a sliding 
window of length £ while maintaining a tentative longest occurring prefix Pj[l..£j] for every pattern Pj, 
initialized by setting £j = kj -\- £ — 1. Whenever we encounter a substring equal to Pj[kj..kj -\- £ — I], i.e., 
Pj[kj..kj -\-£—l\ = T[i..i-\-£— 1], we want to check if Pj[\..kj -\-£—l] = T[i — kj-\-l..i-\-£ — 1] by comparing 
fingerprints of aj = Pj[l..£] and the corresponding fragment of T. This is not trivial as that fragment is 
already to the left of the current window. Hence we conceptually move two sliding windows of length £, 
corresponding to w[i -\- \£..i -|- — I] and w[i..i -\- £—1], respectively. Because kj < \Pj\ — £ < \£, whenever 

the first window generates a request (called request of type I), the second one is still far enough to the left for 
the request to be processed in the future. Furthermore, because Pj[kj..kj -\-£—l] is non-highly periodic and 
the distance between the sliding windows is each pattern Pj contributes at most one pending request of 
type I at any moment and 0(1 -|- j) such requests in total. Then, whenever a request of type I is successfully 
processed, we know that Pj[l..kj -\- £—1] matches with the corresponding fragment of T. We want to check 
if the occurrence of Pj[l..kj -\-£—V\ can be extended to an occurrence of Pj[l..£j -f 1]. To this end, we create 
another request (called request of type II) to check if Pj [£j -\-2 — £. .£j -f 1] matches the corresponding fragment 
of T. This request can be processed using the second window and, again because peT{Pj[l..kj -\-£—V\) > ^£, 
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Algorithm 2: Processing patterns with non-highly periodic Pj[kj..kj + 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 


for j = 1 to s do 

I := kj + t — 1 
for i = l— to n — £ + 1 do 
hi := + ^£..i + |^ — 1 ]) 

/i2 := + £..i + ^ — 1]) 

foreach j : ^{Pj[kj..kj + € — 1]) = do 
I add a request {i + ^£ — kj + 1, j) to Qi 
foreach request {i,j) G Qi at position i do 
if /i 2 = <i)(P[l..£]) then 
I add a request {i — £ + £j + l,j) to Q 2 
remove {i,j) from Qi 

foreach request {i — £,j) G Q 2 at position i do 
if /i 2 = + 2 — i..£j + 1]) then 

Oj := i + £ — £j — 1 

increment £j as long as Pj[£j + 1] = T[oj + £j] 
remove {i,j) from Q 2 


each pattern contribues at most one pending request of type II at any moment and 0{{1 + j)) such request 
in total. Finally, whenever a request of type II is successfully processed, we know that the corresponding 
£j can be incremented. Therefore, we start to naively compare the characters of Pj[£j + l..|Pj]] and the 
corresponding fragment of T. Since at that time no other request of type II is pending for Pj , such modified 
algorithm correctly computes all values £j (and the corresponding positions Oj). 

Lemma 17. Let £ be an arbitrary integer. Suppose we are given a text T of length n and patterns Pi,..., Ps 
such that, for 1 < j < s, we have £ < \Pj\ < ^£ and Pj is non-highly periodic. We can compute the longest 
prefix occurring in T for every pattern Pj in 0(n + s(l + j) logs + s£) total time using 0{s) space. 

By combining all the ingredients, we get the following theorem. 

Theorem 18. Given a text T of length n and patterns Pi,... ,Ps of total length m, we can compute the 
longest prefix occurring in T for every pattern Pj using 0{nlogn + m) total time and 0{s) space. 

Proof. We proceed as in Thm. [TT|and[T^ except that now we use Thm.[^ Lemma [T5| and Lemma[T7| instead 
of Thm. Lemma and Lemma respectively. Additionally, we need 0{nlogn + m) time to distribute 
the long patterns into groups, which is absorbed in the final complexity. □ 

Finally, let us note that it is straightforward to modify the algorithm so that we can specify for every 
pattern Pj an upper bound rj on the starting positions of the occurrences. 

Theorem 19. Given a text T of length n, patterns Pi,..., Ps of total length m, and integers ri,... ,rs, 
we can compute for each pattern the maximum length £j and a position Oj < rj such that Pj[l..£j] = 
T[oj..{oj + £j — 1)], using 0(nlogn + m) total time and 0{s) space. 

6 Las Vegas Algorithms 

As shown below, it is not difficult to modify our dictionary matching algorithm so that it always verifies the 
correctness of the answers. Assuming that we are interested in finding just the leftmost occurrence of every 
pattern, we obtain an 0(nlogn + m)-time Las Vegas algorithm (with inverse-polynomial failure probability). 

In most cases, it suffices to naively verify in 0{\Pj\) time whether the leftmost occurrence of Pj detected 
by the algorithm is valid. If it is not, we are guaranteed that the fingerprints $ admit a false-positive. Since 
this event happens with inverse-polynomial probability, a failure can be reported. 
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This simple solution remains valid for short patterns and non-highly periodic long patterns. For highly 
periodic patterns, the situation is more complicated. The algorithm from Lemma assumes that we are 
correctly detecting all occurrences of every aj so that we can filter out the shiftable ones. Verifying these 
occurrences naively might take too much time, because it is not enough check just one occurrence of every aj. 

Recall that per(Q;j) < If the previous occurrence of aj was at position i > i — we will check if 
per(aj) is a period of + ^ — 1]. If so, either both occurrences (at position i' and at position i) are 

false-positives, or none of them is, and the occurrence at position i' can be ignored. Otherwise, at least one 
occurrence is surely false-positive, and we declare a failure. To check if per(Q!j) is a period of T[i'..i + £ — 1], 
we partition T into overlapping blocks Ti = r 2 = .... Let Tt = T[{t— l)^£+l..(t + l)^£] 

be the rightmost such block fully inside T[l..z-|-^— I]. We calculate the period of Tt using Lemma|^in 0{£) 
time and 0(1) space, and then calculate how far the period extends to the left and to the right, terminating 
if it extends very far. Formally, we calculate the largest e < (<-1-2)^^ and the smallest b > {t—2)^£—\£ such 
that per(rt) is a period of T\b..e]. This takes 0{£) time for every t summing up to 0{n) total time. Then, 
to check if per(aj) is a period of T[i'..i ^ — I] we check if it divides the period of per(r() and furthermore 

r > i-\-£—\ and £ < i'. Finally, we naively verify the the reported leftmost occurrences of Pj. Consequently, 
Las Vegas randomization suffices in Thoerem |I2[ 

with Pj[l..£j] 


For Thm. 18 


we run the Las Vegas version of Thm. 
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and Pj\l..£j 


1] as patterns to 

make sure that the former occur in T but the latter do not. For Thm. we also use Thm.[^ but this time 
we need to see where the reported leftmost occurrences start compared to bounds r,. 


7 Approximating LZ77 in Small Space 

A non-empty fragment T[i..j] is called a previous fragment if the corresponding subword occurs in T at a 
position i' < i. A phrase is either a previous fragment or a single letter not occurring before in T. The 
LZ77-factorization of a text T[l..n] is a greedy factorization of T into z phrases^ T = / 1/2 ■.. /z, such that 
each fi is as long as possible. To formalize the concept of LZ77-approximation, we first make the following 
definition. 

Definition 20. Let w = gig 2 ... ga be a factorization of w into a phrases. We call it c-optimal if the fragment 
corresponding to the concatenation of any c consecutive phrases gt... gij.c-i is not a previous fragment. 

A c-optimal factorization approximates the LZ77-factorization in the number of factors, as the following 
observation states. However, the stronger property of c-optimality is itself useful in certain situations. 

Observation 21. If w = 51^2 ■■■5a is a c-optimal factorization of w into a phrases, and the LZll- 
factorization of w consists of z phrases, then a< c - z. 

We first describe how to use the dictionary matching algorithm described in Thm. to produce a 2- 
optimal factorization of w[\..n] in O(n\ogn) time and 0(z) working space. Then, the resulting parse can 
be further refined to produce a (1 -I- £)-optimal factorization in O(£“^nlogn) additional time and the same 
space using the extension of the dictionary matching algorithm from Thm [T8| 

7.1 2-Approximation Algorithm 

7.1.1 Outline. 

Our algorithm is divided into three phases, each of which refines the factorization from the previous phase: 
Phase 1. Create a factorization of T[l..n] stored implicitly as z chains consisting of O{\ogn) phrases each. 
Phase 2. Try to merge phrases within the chains to produce an C>(l)-optimal factorization. 

Phase 3. Try to merge adjacent factors as long as possible to produce the final 2-optimal factorization. 
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Figure 1: An illustration of Phase 1 of the algorithm, with the “cherries” depicted in thicker lines. The 
horizontal lines represent the LZ77-factorization and the vertical lines depict factors induced by the tree. 
Longer separators are drawn between chains, whose lengths are written in binary with the least significant 
bits on top. 


Every phase takes O(nlogn) time and uses 0(z) working space. In the end, we get a 2-approximation of the 
LZ77-factorization. Phases 1 and 2 use the very simple multiple pattern matching algorithm for patterns of 
equal lengths developed in Thm. while Phase 3 requires the general multiple pattern matching algorithm 
obtained in Thm. [121 

7.1.2 Phase 1. 

To construct the factorization, we imagine creating a binary tree on top the text T of length n = 2^ - see also 
Fig. 0 (we implicitly pad w with sufficiently many $’s to make its length a power of 2). The algorithm works 
in logn rounds, and the i-th round works on level i of the tree, starting at i = 1 (the children of the root). 
On level i, the tree divides T into 2® blocks of size n/2®; the aim is to identify previous fragments among 
these blocks and declare them as phrases. (In the beginning, no phrases exist, so all blocks are unfactored.) 
To find out if a block is a previous fragment, we use Thm. [^and test whether the leftmost occurrence of 
the corresponding subword is the block itself. The exploration of the tree is naturally terminated at the 
nodes corresponding to the previous fragments (or single letters not occurring before), forming the leaves of a 
(conceptual) binary tree. A pair of leaves sharing the same parent is called a cherry. The block corresponding 
to the common parent is induced by the cherry. To analyze the algorithm, we make the following observation: 

Fact 22. A block induced by a cherry is never a previous fragment. Therefore, the number of cherries is at 
most z. 

Proof. The former part follows from construction. To prove the latter, observe that the blocks induced by 
different cherries are disjoint and hence each cherry can be assigned a unique LZ77-factor ending within the 
block. □ 

Consequently, while processing level i of the tree, we can afford storing all cherries generated so far on 
a sorted linked list £. The remaining already generated phrases are not explicitly stored. In addition, we 
also store a sorted linked list Ci of all still unfactored nodes on the current level i (those for which the 
corresponding blocks are tested as previous fragments). Their number is bounded by z (because there is 
a cherry below every node on the list), so the total space is 0{z). Maintaining both lists sorted is easily 
accomplished by scanning them in parallel with each scan of T, and inserting new cherries/unfactored nodes 
at their correct places. Furthermore, in the i-th round we apply Thm. to at most 2® patterns of length 
n/2®, so the total time is + 2* log(2®)) = O(nlogn). 

Next, we analyze the structure of the resulting factorization. Let hx-ihx and be the two con¬ 

secutive cherries. The phrases hx+i. ■. hy-i correspond to the right siblings of the ancestors of hx and to 
the left siblings of the ancestors of hy (no further than to the lowest common ancestor of hx and hy). This 
naturally partitions hxhx+i. • .hy-ihy into two parts, called an increasing chain and a decreasing chain to 
depict the behaviour of phrase lengths within each part. Observe that these lengths are powers of two, so the 
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structure of a chain of either type is determined by the total length of its phrases, which can be interpreted 
as a bitvector with bit i' set to 1 if there is a phrase of length 2® in the chain. Those bitvectors can be 
created while traversing the tree level by level, passing the partially created bitvectors down to the next level 
until finally storing them at the cherries in C. 

At the end we obtain a sequence of chains of alternating types, see Fig. Since the structure of each 
chain follows from its length, we store the sequence of chains rather the actual factorization, which might 
consist of 0(zlogn) = uj{z) phrases. By Fact 22 our representation uses 0{z) words of space and the last 
phrase of a decreasing chain concatenated with the first phrase of the consecutive increasing chain never 
form a previous fragment (these phrases form the block induced by the cherry). 


7.1.3 Phase 2. 


In this phase we merge phrases within the chains. We describe how to process increasing chains; the 
decreasing are handled, mutatis mutandis, analogously. We partition the phrases hg... hr within a chain 
into groups. 

For each chain we maintain an active group, initially consisting of hg, and scan the remaining phrases 
in the left-to-right order. We either append a phrase hi to the active group gj, or we output gj and make 
gj+i = hi the new active group. The former action is performed if and only if the fragment of length 2\hi\ 
starting at the same position as gj is a previous fragment. Having processed the whole chain, we also output 
the last active group. 


Fact 23. Within every chain every group gj forms a valid phrase, but no concatenation of three adjacent 
groups gjgj+igj +2 form a previous fragment. 

Proof. Since the lengths of phrases form an increasing sequence of powers of two, at the moment we need to 
decide if we append hi to gj we have \gj\ < \hi ... hi-i\ < \hi\, so 2\hi\ > \gjhi\, and thus we are guaranteed 
if we append gj, then gjhi is a previous factor. Finally, let us prove the aforementioned optimality condition, 
i.e., that gjgjj.igjj .2 is not a previous fragment for any three consecutive groups. Suppose that we output 
gj while processing hi, that is, gj+i = hi.. .hi'. We did not append hi to gj, so the fragment of length 
2\hi\ starting at the same position as gj is not a previous fragment. However, \gjgj+igj^ 2 \ > l 5 j+i 5 j+ 2 | > 
\hihi+i\ > 2\hi\, so this immediately implies that gjgj+igj +2 is not a previous fragment. □ 


The procedure described above is executed in parallel for all chains, each of which maintains just the 
length of its active group. In the i-th round only chains containing a phrase of length 2® participate (we 
use bit operations to verify which chains have length containing 2® in the binary expansion). These chains 
provide fragments of length 2®“*'^ and Thm.[^is applied to decide which of them are previous fragments. The 
chains modify their active groups based on the answers; some of them may output their old active groups. 
These groups form phrases of the output factorization, so the space required to store them is amortized by the 
size of this factorization. As far as the running time is concerned, we observe that no more than min(z, 
chains participate in the i-th round . Thus, the total running time is 0(n-\- ^ log ^) = 0{n\ogn). 

To bound the overall approximation guarantee, suppose there are five consecutive output phrases forming a 
previous fragment. By Fact 22 these fragments cannot contain a block induced by any cherry. Thus, the 
phrases are contained within two chains. However, by Fact |23| no three consecutive phrases obtained from a 
single chain form a previous fragment. Hence the resulting factorization is 5-optimal. 


7.1.4 Phase 3. 

The following lemma achieves the final 2-approximation: 

Lemma 24. Given a c-optimal factorization, one can compute a 2-optimal factorization using 0{c- n\ogn) 
time and 0(c ■ z) space. 

Proof. The procedure consists of c iterations. In every iteration we first detect previous fragments corre¬ 
sponding to concatenations of two adjacent phrases. The total length of the patterns is up to 2n, so this takes 
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O{nlogn + m) = 0{n\ogn) time and 0{c- z) space using Thm. 12 Next, we scan through the factorization 
and merge every phrase gi with the preceding phrase gi-i if gi-igi is a previous fragment and gi-i has not 
been just merged with its predecessor. 

We shall prove that the resulting factorization is 2-optimal. Consider a pair of adjacent phrases gi-igi 
in the final factorization and let j be the starting position of gi. Suppose gi-igi is a previous fragment. 
Our algorithm performs merges only, so the phrase ending at position j — 1 concatenated with the phrase 
starting at position j formed a previous fragment at every iteration. The only reason that these factors 
were not merged could be another merge of the former factor. Consequently, the factor ending at position 
j — 1 took part in a merge at every iteration, i.e., gi-i is a concatenation of at least c phrases of the input 
factorization. However, all the phrases created by the algorithm form previous fragments, which contradicts 
the c-optimality of the input factorization. □ 


7.2 Approximation Scheme 

The starting point is a 2-optimal factorization into a phrases, which can be found in 0{n\ogn) time using 
the previous method. The text is partitioned into |a blocks corresponding to | consecutive phrases. Every 
block is then greedily factorized into phrases. The factorization is implemented using Thm. [^to compute 
the longest previous fragments in parallel for all blocks as follows. Denote the starting positions of the blocks 
by 6 i < 62 < ... and let ij be the current position in the z-th block, initially set to bi. For every i, we find the 
longest previous fragment starting at ij and fully contained inside T[ij..bj+i — 1] by computing the longest 
prefix of T[zj.. 6 j+i — 1] occurring in T and starting in T[\..ij — 1], denoted T[ij..ij +ij — 1]. Then we output 
every T[ij..ij -|- £j — 1] as a new phrase and increase ij by £j. Because every block, by definition, can be 
factorized into - phrases and the greedy factorization is optimal, this requires 0{-nlogn) time in total. To 
bound the approximation guarantee, observe that every phrase inside the block, except possibly for the last 
one, contains an endpoint of a phrase in the LZ77-factorization. Consequently, the total number of phrases 
is at most z + < (1 + e)z. 

Theorem 25. Given a text T of length n whose LZ77-factorization consists of z phrases, we can factorize T 
into at most 2z phrases using O{n\ogn) time and 0{z) space. Moreover, for any e G (0,1] in O{e~^n\ogn) 
time and 0(z) space we can compute a factorization into no more than (1 -I- e)z phrases. 
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